Here are all the packages needed to get started.
library(gt)
library(tidyverse)
library(gtsummary)
library(plotly)
library(readxl)
library(plotly)
library(corrplot)
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.92 readxl_1.4.2 plotly_4.10.1 gtsummary_1.7.1
## [5] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
## [9] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [13] ggplot2_3.4.2 tidyverse_2.0.0 gt_0.9.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.39 bslib_0.4.2
## [4] colorspace_2.1-0 vctrs_0.6.2 generics_0.1.3
## [7] htmltools_0.5.5 viridisLite_0.4.2 yaml_2.3.7
## [10] utf8_1.2.3 rlang_1.1.1 jquerylib_0.1.4
## [13] pillar_1.9.0 glue_1.6.2 withr_2.5.0
## [16] lifecycle_1.0.3 cellranger_1.1.0 munsell_0.5.0
## [19] gtable_0.3.3 htmlwidgets_1.6.2.9000 evaluate_0.21
## [22] knitr_1.42 tzdb_0.4.0 fastmap_1.1.1
## [25] fansi_1.0.4 scales_1.2.1 cachem_1.0.8
## [28] jsonlite_1.8.4 hms_1.1.3 digest_0.6.31
## [31] stringi_1.7.12 grid_4.2.1 cli_3.6.1
## [34] tools_4.2.1 magrittr_2.0.3 sass_0.4.6
## [37] lazyeval_0.2.2 pkgconfig_2.0.3 broom.helpers_1.13.0
## [40] data.table_1.14.8 xml2_1.3.4 timechange_0.2.0
## [43] rmarkdown_2.21 httr_1.4.6 rstudioapi_0.14
## [46] R6_2.5.1 compiler_4.2.1
In univariate analysis, we look at a single variable and describe 3 different things:
Helps explain where the middle of the data is. This can be measured in 3 main ways.
grades <- c(78,79,80,81,82)
mean(grades)
## [1] 80
median(grades)
## [1] 80
There isn’t an easy way of doing this, so I created a function instead.
getModes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
getModes(grades)
## [1] 78 79 80 81 82
Since there is no value that occurs most frequently, they all show. Mode is rarely used.
For this, I am going to randomly generate 100 exam scores.
grades <- rnorm(100,mean = 75, sd = 5 )
hist(grades)
You can mess with the breaks = arguement to get
different numbers of bins.
Let’s look at a different dataset. Starwars character heights.
var(starwars$height)
## [1] NA
What happened? We have nulls in our data, so we cannot calculate the variance until we tell the system to ignore null values
var(starwars$height, na.rm = TRUE)
## [1] 1208.983
sd(starwars$height, na.rm = TRUE)
## [1] 34.77043
IQR(starwars$height, na.rm = TRUE)
## [1] 24
range(starwars$height, na.rm = TRUE)
## [1] 66 264
You can visualize the spread as well this with a boxplot.
boxplot(starwars$height)
You can see that a lot of them fall outside of the fences.
You could also visualize this with a histogram
hist(starwars$height, breaks = "fd")
To get most of the items of interest, you can simply use the
summary() function. The standard deviation nor the variance
is displayed, so you would still need those.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
When looking at 2 variables, we use correlation, scatterplots, and time series graphs.
cor(mtcars$mpg, mtcars$hp)
## [1] -0.7761684
These are strongly negatively correlated. We might say, horsepower has a inverse relationship with mpg, or mpg has a negative relationship with horsepower.
We can also visualize this relationship with a scatterplot.
plot(mtcars$hp, mtcars$mpg)
we can also make this look nicer with some labels
plot(mtcars$hp, mtcars$mpg,
main = "Scatterplot of Horsepower and Miles Per Gallon",
xlab = "Horsepower",
ylab = "Miles Per Gallon")
Since the data we’ve seen thus far has been only cross section, let’s
bring in the data from the previous module.
Wealth1Percent.xlsx
Wealth1percent <- read_excel("../Data/Wealth1percent.xlsx",
col_types = c("date", "numeric","numeric"))
To add the trend line, we need to do something that we haven’t discussed yet. Don’t worry, I’ll explain it in the next module.
plot(Wealth1percent$quarter,Wealth1percent$Share, type = "l")
abline(lm(Share ~ quarter, data = Wealth1percent),lty = 2)
Here are some more ways to display data
We can represent data in categorical fashion:
table(starwars$hair_color)
##
## auburn auburn, grey auburn, white black blond
## 1 1 1 13 3
## blonde brown brown, grey grey none
## 1 18 1 1 37
## unknown white
## 1 4
Or quantitative
bins <- seq(10,34,by = 2)
mpg <- cut(mtcars$mpg,bins)
table(mpg)
## mpg
## (10,12] (12,14] (14,16] (16,18] (18,20] (20,22] (22,24] (24,26] (26,28] (28,30]
## 2 1 7 3 5 5 2 2 1 0
## (30,32] (32,34]
## 2 2
n <- table(mtcars$gear)
barplot(n,xlab="Number of Gears")
We can add scale = 3 to make this line up properly
stem(mtcars$hp, scale = 3)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 2
## 6 | 2566
## 7 |
## 8 |
## 9 | 1357
## 10 | 59
## 11 | 0003
## 12 | 33
## 13 |
## 14 |
## 15 | 00
## 16 |
## 17 | 555
## 18 | 000
## 19 |
## 20 | 5
## 21 | 5
## 22 |
## 23 | 0
## 24 | 55
## 25 |
## 26 | 4
## 27 |
## 28 |
## 29 |
## 30 |
## 31 |
## 32 |
## 33 | 5
Check out this fun stuff! Makes things look much cleaner.
Categorical
hair <- table(starwars$hair_color)%>%
data.frame()
colnames(hair)[1] <- "Hair Color"
gt(hair)
| Hair Color | Freq |
|---|---|
| auburn | 1 |
| auburn, grey | 1 |
| auburn, white | 1 |
| black | 13 |
| blond | 3 |
| blonde | 1 |
| brown | 18 |
| brown, grey | 1 |
| grey | 1 |
| none | 37 |
| unknown | 1 |
| white | 4 |
Or quantitative
bins <- seq(10,34,by = 2)
mpg <- cut(mtcars$mpg,bins)
table(mpg)%>%
data.frame() %>%
gt()
| mpg | Freq |
|---|---|
| (10,12] | 2 |
| (12,14] | 1 |
| (14,16] | 7 |
| (16,18] | 3 |
| (18,20] | 5 |
| (20,22] | 5 |
| (22,24] | 2 |
| (24,26] | 2 |
| (26,28] | 1 |
| (28,30] | 0 |
| (30,32] | 2 |
| (32,34] | 2 |
Using our classic mtcars dataset.
mtcars%>% select(mpg, cyl,hp) %>%
tbl_summary(statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}"),
all_categorical() ~ "{n} / {N} ({p}%)"),
type = all_continuous() ~ "continuous2"
)
| Characteristic | N = 321 |
|---|---|
| mpg | |
| Â Â Â Â Mean (SD) | 20.1 (6.0) |
| Â Â Â Â Median (IQR) | 19.2 (15.4, 22.8) |
| Â Â Â Â Range | 10.4, 33.9 |
| cyl | |
| Â Â Â Â 4 | 11 / 32 (34%) |
| Â Â Â Â 6 | 7 / 32 (22%) |
| Â Â Â Â 8 | 14 / 32 (44%) |
| hp | |
| Â Â Â Â Mean (SD) | 147 (69) |
| Â Â Â Â Median (IQR) | 123 (97, 180) |
| Â Â Â Â Range | 52, 335 |
| 1 n / N (%) | |
ggplot(mtcars, aes(mpg))+
geom_histogram(binwidth = 2,col = 'black', fill = 'darkblue', alpha = 0.75)+
labs(title = 'Distribution of Miles Per Gallon', caption = "1974 Motor Trend US Magazine")+
theme_bw()
Since this is an interactive graph, it will not show up in a .pdf file, but it is great to look at in an .html document.
plot_ly(x = ~mtcars$mpg, type = "histogram", alpha = 0.6) %>%
layout(title = 'Distribution of Miles Per Gallon',
xaxis = list(title = 'Miles Per Gallon'),
yaxis = list(title = 'Count'))
I think the plotly version of a boxplot is superior since it is interactive.
plot_ly(y = starwars$height, type = 'box', name = 'Height [cm]',text = starwars$name) %>%
layout(title = 'Distribution of Star Wars Character Heights')
reduced <- mtcars %>%
select(mpg, hp, wt,qsec,disp,drat)
corrplot(cor(reduced),
type = "lower",
order = "hclust",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
diag = FALSE)
ggplot(starwars,aes(height, mass))+
geom_point(color = 'gray40')+
theme_bw()+
labs(title = "Relationship between Mass and Height of Star Wars Characters")
plot_ly(starwars, y = ~mass, x = ~height, type = 'scatter',text = ~name, mode = "markers")
ggplot(Wealth1percent, aes(quarter, Share))+
geom_line(color = 'gray40',alpha = 0.75)+
geom_smooth(method = "lm", se = F, color = 'darkblue', linetype = 'dashed')+
theme_bw()+
labs(title = "Share of Total Net Worth Held by the Top 1%",
subtitle = 'from 1989-2022',
x = "Date",
y = "Share")
To add the trend line, it gets quite tricky.
model <- lm(Share~quarter,Wealth1percent)
Trend <- predict(model,data = Wealth1percent$quarter)
plot_ly(Wealth1percent, x = ~quarter, y = ~Share, type = 'scatter', mode = 'lines', name = "Share") %>%
add_trace(y = ~Trend, name = 'Trend', mode = 'lines')